Chapter 9 Chinese Text Processing
In this chapter, we will discuss one of the most important issues in Chinese language/text processing, i.e., word segmentation. When we discuss tokenization in Chapter 5, it is easy to do the word tokenization in English as the word boundaries in English are more clearly delimited by whitespaces. Chinese, however, does not have whitespaces between characters, which leads to a serious problem for word tokenization.
This chapter is devoted to Chinese text processing. We will look at the issues of word tokenization and talk about the most-often used library, jiebaR, for Chinese word segmentation. Also, we will include several case studies on Chinese text processing.
library(tidyverse)
library(tidytext)
library(quanteda)
library(stringr)
library(jiebaR)
library(readtext)9.1 Chinese Word Segmenter jiebaR
9.1.1 Start
First, if you haven’t installed the library jiebaR, you may need to install it manually:
This is the version used for this tutorial.
## [1] '0.10.99'
Now let us take a look at a quick example. Let us assume that in our corpus, we have collected only one text document, with only a short paragraph.
text <- "綠黨桃園市議員王浩宇爆料,指民眾黨不分區被提名人蔡壁如、黃瀞瑩,在昨(6)日才請辭是為領年終獎金。台灣民眾黨主席、台北市長柯文哲7日受訪時則說,都是按流程走,不要把人家想得這麼壞。"There are two important steps in Chinese word segmentation:
- initilzie a word segmenter object using
worker() - segment the texts using
segment()
## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指民眾"
## [7] "黨" "不" "分區" "被" "提名" "人"
## [13] "蔡壁如" "黃" "瀞" "瑩" "在昨" "6"
## [19] "日" "才" "請辭" "是" "為領" "年終獎金"
## [25] "台灣民眾" "黨" "主席" "台北" "市長" "柯文"
## [31] "哲" "7" "日" "受訪" "時則" "說"
## [37] "都" "是" "按" "流程" "走" "不要"
## [43] "把" "人家" "想得" "這麼" "壞"
To segment the document, text, you first initialize a segmenter seg1 using worker() and feed this segmenter to segment(jiebar = seg1)and segment text into words.
9.1.2 Settings
There are many different parameters you can specify when you initialize the segmenter worker(). You may get more detail via the documentation ?worker. Some of the important arguments include:
user = ...: This argument is to specify the path to a user-defined dictionarystop_word = ...: This argument is to specify the path to a stopword listsymbol = FALSE: Whether to return symbols (the default is FALSE)bylines = FALSE: Whether to return a list or not
9.1.3 User-defined dictionary
From the above example, it is clear to see that some of the words are not correctly identified by the current segmenter: for example, 民眾黨, 不分區, 黃瀞瑩, 柯文哲. It is always recommended to include a user-defined dictionary when doing the word segmentation because different corpora may have their own unique vocabulary. This can be done when you initialize the segmenter using worker().
## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指"
## [7] "民眾黨" "不分區" "被" "提名" "人" "蔡壁如"
## [13] "黃瀞瑩" "在昨" "6" "日" "才" "請辭"
## [19] "是" "為領" "年終獎金" "台灣" "民眾黨" "主席"
## [25] "台北" "市長" "柯文哲" "7" "日" "受訪"
## [31] "時則" "說" "都" "是" "按" "流程"
## [37] "走" "不要" "把" "人家" "想得" "這麼"
## [43] "壞"
The format of the user-defined dictionary is a text file, with one word per line. Also, the default encoding of the dictionary is UTF-8. Please note that in Windows, the default encoding of a txt file created by Notepad may not be UTF-8.
Creating a user-defined dictionary may take a lot of time. You may consult 搜狗詞庫, which includes many domain-specific dictionaries created by others. However, it should be noted that the format of the dictionaries is .scel. You may need to convert the .scel to .txt before you use it in jiebaR. To do the coversion automatically, please consult the library cidian. Also, you need to do the traditional-simplified Chinese conversion as well. For this, you may consult the library ropencc in R.
9.1.4 Stopwords
When you initialize the segmenter, you can also specify a stopword list, i.e., words you do not need to include in the later analyses. For example, in text mining, functional words are usually less informative.
## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指民眾"
## [7] "黨" "不" "分區" "被" "提名" "人"
## [13] "蔡壁如" "黃" "瀞" "瑩" "在昨" "6"
## [19] "才" "請辭" "為領" "年終獎金" "台灣民眾" "黨"
## [25] "主席" "台北" "市長" "柯文" "哲" "7"
## [31] "受訪" "時則" "說" "按" "流程" "走"
## [37] "不要" "把" "人家" "想得" "這麼" "壞"
9.1.5 POS Tagging
So far we haven’t seen the parts-of-speech tags provided by the word segmenter. If you need the POS tags of the words, you need to specify the argument type = "tag" when you initialize the worker().
seg4 <- worker(type = "tag", user = "demo_data/dict-ch-user-demo.txt", stop_word = "demo_data/stopwords-ch-demo.txt")
segment(text, seg4)## n ns n x n n x
## "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指" "民眾黨"
## x p v n x x x
## "不分區" "被" "提名" "人" "蔡壁如" "黃瀞瑩" "在昨"
## x d v x n x x
## "6" "才" "請辭" "為領" "年終獎金" "台灣" "民眾黨"
## n ns n x x v x
## "主席" "台北" "市長" "柯文哲" "7" "受訪" "時則"
## zg p n v df p n
## "說" "按" "流程" "走" "不要" "把" "人家"
## x r a
## "想得" "這麼" "壞"
The following table lists the annotations of the POS tagsets used in jiebaR:
9.1.6 Default
You can check the dictionaries and the stopword list being used by jiebaR in your current enviroment:
## [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/jiebaRD/dict"
## [1] "backup.rda" "hmm_model.utf8" "hmm_model.zip" "idf.utf8"
## [5] "idf.zip" "jieba.dict.utf8" "jieba.dict.zip" "model.rda"
## [9] "README.md" "stop_words.utf8" "user.dict.utf8"
scan(file="/Library/Frameworks/R.framework/Versions/3.5/Resources/library/jiebaRD/dict/stop_words.utf8",
what=character(),nlines=50,sep='\n',
encoding='utf-8',fileEncoding='utf-8')## [1] "\"" "." "。" "," "、" "!" "?" ":" ";" "`" "﹑" "•"
## [13] """ "^" "…" "‘" "’" "“" "”" "〝" "〞" "~" "\\" "∕"
## [25] "|" "¦" "‖" "— " "(" ")" "〈" "〉" "﹞" "﹝" "「" "」"
## [37] "‹" "›" "〖" "〗" "】" "【" "»" "«" "』" "『" "〕" "〔"
## [49] "》" "《"
9.1.7 Reminder
When we use segment() as a tokenization method in the unnest_tokens(), it is very important to specify bylines = TRUE in worker(). This setting would make sure that segment() takes a text-based vector as input and return a list of word-based vectors of the same length as output.
NB: When bylines = FALSE, segment() returns a vector.
seg_byline_1 <- worker(bylines = T)
seg_byline_0 <- worker(bylines = F)
(text_tag_1 <- segment(text, seg_byline_1))## [[1]]
## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指民眾"
## [7] "黨" "不" "分區" "被" "提名" "人"
## [13] "蔡壁如" "黃" "瀞" "瑩" "在昨" "6"
## [19] "日" "才" "請辭" "是" "為領" "年終獎金"
## [25] "台灣民眾" "黨" "主席" "台北" "市長" "柯文"
## [31] "哲" "7" "日" "受訪" "時則" "說"
## [37] "都" "是" "按" "流程" "走" "不要"
## [43] "把" "人家" "想得" "這麼" "壞"
## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指民眾"
## [7] "黨" "不" "分區" "被" "提名" "人"
## [13] "蔡壁如" "黃" "瀞" "瑩" "在昨" "6"
## [19] "日" "才" "請辭" "是" "為領" "年終獎金"
## [25] "台灣民眾" "黨" "主席" "台北" "市長" "柯文"
## [31] "哲" "7" "日" "受訪" "時則" "說"
## [37] "都" "是" "按" "流程" "走" "不要"
## [43] "把" "人家" "想得" "這麼" "壞"
## [1] "list"
## [1] "character"
9.2 Chinese Text Analytics Pipeline
In Chapter 5, we have talked about the work pipeline for normal English texts processing, as shown below:
For Chinese texts, the work flow is pretty much the same. The most important trick is in the step of tokenization, i.e., unnest_tokens(): we need to specify our own tokenzier for the argument token = ... in the unnest_tokens().
It is important to note that when we specify a self-defined token function, this function should take a character vector (i.e., a text-based vector) and return a list of character vectors (i.e., word-based vectors) of the same length.
In other words, when initializing the Chinese word segmenter, we need to specify the argument byline = TRUE for worker(byline = TRUE).

So based on our simple-corpus example above, we can create
# a text-based tidy corpus
a_small_tidy_corpus <- text %>% corpus %>% tidy %>% mutate(textID = row_number())
a_small_tidy_corpus# initialize segmenter
my_seg <- worker(bylines = T, user = "demo_data/dict-ch-user-demo.txt")
# tokenization
a_small_tidy_corpus_by_word <- a_small_tidy_corpus %>%
unnest_tokens(word, text, token = function(x) segment(x, jiebar = my_seg))
a_small_tidy_corpus_by_wordIn the following sections, we look at a few more case studies of Chinese text processing using the news articles collected from Apple News as our example corpus. The dataset is available in our course dropbox drive: demo_data/applenews10000.tar.gz.
9.3 Case Study 1: Word Frequency and Wordcloud
We follow the same steps as illstrated in the above flowchart ??:
- create a text-based tidy corpus object
apple_df(i.e., a tibble) - intialize a word segmenter using
worker() - tokenize the corpus into a word-based tidy corpus object using
unnest_tokens()
# loading the corpus
# NB: this may take some time
apple_df <- readtext("demo_data/applenews10000.tar.gz") %>%
as_tibble() %>%
filter(text !="") %>%
mutate(doc_id = row_number())
apple_df